Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Include nltk.download() in benchmark.prepare_backend() (webarena / visualwebarena) #224

Merged
merged 1 commit into from
Oct 30, 2024

Conversation

gasse
Copy link
Collaborator

@gasse gasse commented Oct 30, 2024

No description provided.

@gasse gasse changed the title Update webarena / visualwebarena prepare_backend() Update to webarena / visualwebarena prepare_backend() Oct 30, 2024
@gasse gasse changed the title Update to webarena / visualwebarena prepare_backend() Include nltk.download() in benchmark.prepare_backend() (webarena / visualwebarena) Oct 30, 2024
@gasse gasse merged commit 050c715 into main Oct 30, 2024
13 checks passed
@gasse gasse deleted the gasse/patch_58 branch October 30, 2024 19:24
qipeng pushed a commit to orby-ai-engineering/BrowserGym that referenced this pull request Nov 20, 2024
qipeng added a commit to orby-ai-engineering/BrowserGym that referenced this pull request Jan 18, 2025
* Patch VWA task IDs

* Add BLIP2 evaluator; patch timeout

* Actually add the captioning_fn into evaluator constructor

* downgrading ubuntu version for github tests (ServiceNow#179)

* making webarena tests not run on PRs (ServiceNow#181)

* making webarena tests not run on PRs

* making visualwebarena tests not run on PRs

* SoM bugfix (ServiceNow#185)

* version bump v0.8.1

* workflow image downgrade: ubuntu-latest -> ubuntu-22.04

* support custom observation

* add user data dir

* Benchmarks (ServiceNow#173)

* new ControlOrMeta key modifier (ServiceNow#187)

* Multi-tab fix (ServiceNow#188)

* Global demo_mode flag (ServiceNow#177)

* HighLevelActionSetArgs default value (ServiceNow#191)

* version bump v0.9.0

* Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198)

* Benchmarks update (ServiceNow#197)

* Miniwob number of seeds 10 -> 5

* remove most benchmark variants

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* New benchmark AssistantBench (ServiceNow#186)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Default `browsergym_split` metadata for every benchmark (ServiceNow#190)


---------

Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com>
Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com>
Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com>

* Fixing logging with multiple jobs (ServiceNow#182)

* Benchmark updates (ServiceNow#199)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* version bump 0.10.0

* README update (ServiceNow#200)

* Train / test splits for workarena-l2/l3 (ServiceNow#203)

* Fine-grained benchmark action sets (ServiceNow#202)

* version bump v0.10.1

* Update README.md

* Update README.md

* Benchmark.prepare_backend() (ServiceNow#204)

* save_step_info bugfix (obs=None) (ServiceNow#207)

* version bump v0.10.2

* full_reset fixes (ServiceNow#209)

* Hide all bids from obs (ServiceNow#212)

* Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Leaner Unicode() gym space (ServiceNow#218)

* a method to get the status of an experiment (ServiceNow#219)

* version bump v0.11.0

* Rename benchmark after subset_from_split() (ServiceNow#221)

* exp_dir sanitization (ServiceNow#222)

* get_step_info() bugfix (ServiceNow#223)

* Set webarena / visualwebarena max steps to 30 (ServiceNow#214)

* Benchmark dependencies (ServiceNow#220)

* Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224)

* version bump v0.11.1

* ExpResult.status minor fix (ServiceNow#225)

* version bump 0.11.2

* Fix duplicate depends_on in webarena metadata (ServiceNow#228)

* Duplicate webarena dependencies fix (ServiceNow#229)

* nltk.download() during import for webarena and visualwebarena (ServiceNow#227)

* Refactor full_reset() for webarena / visualwebarena (ServiceNow#230)

* webarena_tiny (ServiceNow#232)

* Set ExpArgs.exp_id at post-init time (ServiceNow#231)

* Remove ARIA extraction warnings (ServiceNow#233)

* Update README.md

* Update README.md

* Update README.md

* version bump v0.11.3

* ci tests fix (ServiceNow#234)

* Benchmark update for weblinx (ServiceNow#235)

* Refactor ExpArgs.exp_id generation (ServiceNow#236)

* VisualWebArena task dependencies (ServiceNow#237)

* VWA dependencies fix (ServiceNow#239)

* VWA evaluator fix, missing captioning_fn (ServiceNow#240)

* version bump v0.12.0

* Update README.md

* VWA hide huggingface progress bar (ServiceNow#241)

* WebLINX pre-download data in prepare_backend() (ServiceNow#226)

* AssistantBench + WebLINX fixes (ServiceNow#244)

* Increase assistantbench max_steps to 30

* Setting AssistantBench locale and timezone

* Dedicated AssistantBench action set

* small fix

* missing change

* Lenient frame marking in last retry (ServiceNow#245)

* WA / VWA default action set update (ServiceNow#247)

* version bump v0.13.0

* visualwebarena massage (ServiceNow#248)

* Minor fix (ServiceNow#250)

* Remove gym warnings "obs not within observation space" (ServiceNow#251)

* Lower trace level info -> debug (ServiceNow#252)

* Make env.close() usable after failure (finally block) (ServiceNow#253)

* add init script support

* VWA / WA updates (ServiceNow#254)

* Minor refactors (ServiceNow#255)

* Optional method AbstractBrowserTask.teardown()

* browsergym registration refactor

* Deal with problematic frame unmarking (ServiceNow#256)

* Add missing property exception to _get_obs() retry (ServiceNow#258)

* Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257)

* Massage WebArena instance (ServiceNow#259)

* Refactor AssistantBench output directories (ServiceNow#242)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* version bump v0.13.1

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Authors update (ServiceNow#260)

* TapeAgents export for experiment results (ServiceNow#238)

* Update README.md

* Cleanup

* Add weblinx_browsergym as a dependency (ServiceNow#261)

* Typo fix (ServiceNow#264)

* Update requirements.txt to latest libvisualwebarena package that includes local hosting (ServiceNow#165)

* adding AgentInfo to __init__ for convenience (ServiceNow#166)

* libvisualwebarena==0.0.14 (ServiceNow#171)

fixed the jsons file!

* Leaner traces (ServiceNow#169)

* images aren't saved in pkl files anymore, and are stuffed back in at load time

* added kwargs to control img/som saving

* saving as png, adding screenshots back into obs

* retrocompatibility for image loading

* making get_screenshots work for png and jpg

* fixing image types and closing files

* Goal refactor to allow for local image files (ServiceNow#110)


---------

Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>
Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* version bump 0.8.0

* Integrate AgentLab tests (ServiceNow#176)

* downgrading ubuntu version for github tests (ServiceNow#179)

* making webarena tests not run on PRs (ServiceNow#181)

* making webarena tests not run on PRs

* making visualwebarena tests not run on PRs

* SoM bugfix (ServiceNow#185)

* version bump v0.8.1

* workflow image downgrade: ubuntu-latest -> ubuntu-22.04

* Benchmarks (ServiceNow#173)

* new ControlOrMeta key modifier (ServiceNow#187)

* Multi-tab fix (ServiceNow#188)

* Global demo_mode flag (ServiceNow#177)

* HighLevelActionSetArgs default value (ServiceNow#191)

* version bump v0.9.0

* Reverting workarena_l1 benchmark to original seed sampling (ServiceNow#198)

* Benchmarks update (ServiceNow#197)

* Miniwob number of seeds 10 -> 5

* remove most benchmark variants

---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* New benchmark AssistantBench (ServiceNow#186)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Default `browsergym_split` metadata for every benchmark (ServiceNow#190)


---------

Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com>
Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com>
Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com>

* Fixing logging with multiple jobs (ServiceNow#182)

* Benchmark updates (ServiceNow#199)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* version bump 0.10.0

* README update (ServiceNow#200)

* Train / test splits for workarena-l2/l3 (ServiceNow#203)

* Fine-grained benchmark action sets (ServiceNow#202)

* version bump v0.10.1

* Update README.md

* Update README.md

* Benchmark.prepare_backend() (ServiceNow#204)

* save_step_info bugfix (obs=None) (ServiceNow#207)

* version bump v0.10.2

* full_reset fixes (ServiceNow#209)

* Hide all bids from obs (ServiceNow#212)

* Adding weblinx config to DEFAULT_BENCHMARKS (ServiceNow#208)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* Leaner Unicode() gym space (ServiceNow#218)

* a method to get the status of an experiment (ServiceNow#219)

* version bump v0.11.0

* Rename benchmark after subset_from_split() (ServiceNow#221)

* exp_dir sanitization (ServiceNow#222)

* get_step_info() bugfix (ServiceNow#223)

* Set webarena / visualwebarena max steps to 30 (ServiceNow#214)

* Benchmark dependencies (ServiceNow#220)

* Include nltk.download() in benchmark.prepare_backend() for webarena / visualwebarena (ServiceNow#224)

* version bump v0.11.1

* ExpResult.status minor fix (ServiceNow#225)

* version bump 0.11.2

* Fix duplicate depends_on in webarena metadata (ServiceNow#228)

* Duplicate webarena dependencies fix (ServiceNow#229)

* nltk.download() during import for webarena and visualwebarena (ServiceNow#227)

* Refactor full_reset() for webarena / visualwebarena (ServiceNow#230)

* webarena_tiny (ServiceNow#232)

* Set ExpArgs.exp_id at post-init time (ServiceNow#231)

* Remove ARIA extraction warnings (ServiceNow#233)

* Update README.md

* Update README.md

* Update README.md

* version bump v0.11.3

* ci tests fix (ServiceNow#234)

* Benchmark update for weblinx (ServiceNow#235)

* Refactor ExpArgs.exp_id generation (ServiceNow#236)

* VisualWebArena task dependencies (ServiceNow#237)

* VWA dependencies fix (ServiceNow#239)

* VWA evaluator fix, missing captioning_fn (ServiceNow#240)

* version bump v0.12.0

* Update README.md

* VWA hide huggingface progress bar (ServiceNow#241)

* WebLINX pre-download data in prepare_backend() (ServiceNow#226)

* AssistantBench + WebLINX fixes (ServiceNow#244)

* Increase assistantbench max_steps to 30

* Setting AssistantBench locale and timezone

* Dedicated AssistantBench action set

* small fix

* missing change

* Lenient frame marking in last retry (ServiceNow#245)

* WA / VWA default action set update (ServiceNow#247)

* version bump v0.13.0

* visualwebarena massage (ServiceNow#248)

* Minor fix (ServiceNow#250)

* Remove gym warnings "obs not within observation space" (ServiceNow#251)

* Lower trace level info -> debug (ServiceNow#252)

* Make env.close() usable after failure (finally block) (ServiceNow#253)

* VWA / WA updates (ServiceNow#254)

* Minor refactors (ServiceNow#255)

* Optional method AbstractBrowserTask.teardown()

* browsergym registration refactor

* Deal with problematic frame unmarking (ServiceNow#256)

* Add missing property exception to _get_obs() retry (ServiceNow#258)

* Bump libwebarena / libvisualwebarena dependencies (ServiceNow#257)

* Massage WebArena instance (ServiceNow#259)

* Refactor AssistantBench output directories (ServiceNow#242)


---------

Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>

* version bump v0.13.1

* Fix broken links

* Update README.md

* fix merging issues

* Update README.md (ServiceNow#270)

* Update README.md

* README update

* More permissive WA/VWA instance reset (ServiceNow#272)

* New debug benchmark visualwebarena_tiny (ServiceNow#271)

* Version bump v0.13.2

* Update README.md

* Metadata column fix (ServiceNow#278)

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Shunt WA / VWA unit tests

* README update

* Minor fixes (ServiceNow#281)

* version bump v0.13.3

* remove unused fluff

* revert more unintended changes

---------

Co-authored-by: Peng Qi <1572802+qipeng@users.noreply.github.com>
Co-authored-by: Thibault LSDC <78021491+ThibaultLSDC@users.noreply.github.com>
Co-authored-by: Maxime Gasse <maxime.gasse@gmail.com>
Co-authored-by: Yanan Xie <yanan@orby.ai>
Co-authored-by: Alexandre Lacoste <alex.lacoste.shmu@gmail.com>
Co-authored-by: oriyor <39461788+oriyor@users.noreply.github.com>
Co-authored-by: Xing Han Lu <21180505+xhluca@users.noreply.github.com>
Co-authored-by: ljang0 <54288880+ljang0@users.noreply.github.com>
Co-authored-by: Megh Thakkar <Megh-Thakkar@users.noreply.github.com>
Co-authored-by: Imene Kerboua <33312980+imenelydiaker@users.noreply.github.com>
Co-authored-by: Oleh Shliazhko <ollmer@users.noreply.github.com>
Co-authored-by: Thibault Le Sellier de Chezelles <thibault.de.chezelles@gmail.com>
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant